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Abstract: 

A Corpus-Based Statistics-Oriented 
(CBSO) methodology, which is an attempt 
to avoid the drawbacks of traditional rule- 
based approaches and purely statistical 
approaches, is introduced in this paper. 
Rule-based approaches, with rules induced 
by human experts, had been the dominant 
paradigm in the natural language 
processing community. Such approaches, 
however, suffer from serious difficulties in 
knowledge acquisition in terms of cost and 
consistency. Therefore, it is very difficult 
for such systems to be scaled-up. 
Statistical methods, with the capability of 
automatically acquiring knowledge from 
corpora, are becoming more and more 
popular, in part, to amend the 
shortcomings of rule-based approaches. 
However, most simple statistical models, 
which adopt almost nothing from existing 
linguistic knowledge, often result in a large 
parameter space and, thus, require an 
unaffordably large training corpus for even 
well-justified linguistic phenomena. The 
corpus-based statistics-oriented (CBSO) ... 
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M H In this paper. some corpus-based , statistics-oriented natural language processing tech- 
niques, include. Shannon's noisy channel model and its applications, n-gram model. the methods to 
estimate and smooth arguments , preference-Used parser sod so on. were introduced. It was also 
discussed that how to use these techniques in the Chinese language processing. 
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****** (Corpus Lingui.tics)*A+¥ft 
**a*ftW-n*tttt**W#tt#*#*.-E 

*fttt»*ra« 6 at** it.**. 

M»**«**£ftft*\aA«*.*AAa# 
#r. S ****** «M*««*+Maj|.ff * 
****fffttiaaftft*?r*M**ft**»* 

-ft»**a«*ttita«a.A#a«*«,*- 
**T.A**±i*.******«*«Mu n « B <j*~ 

Attatti*fta#fc**SA. 

**194JMp, Warren Weaser ftttft. *TIH*MI 

*Afettaaaa.ftA-*ttttM***ft*«l 
aa*aa.X+¥ft.**£.£ft7A«*a.tt: 

«*»t+B9at4.fi« Chomsky fc"*JiStt*T+Xt 
n SffHtatrnWttfF* Minsky fO Pspert *" « 
**fi«(Perceptrons)»+«#i6R»B4*rf.i6tt 

iS**.tt*tta*»HT-**W*JR.W»tt 
# «*«*** . £* *S«*«t* . *tt*iP«* 

««tt.***AMft*TO*ft*tt*tt«*vr 

*.tttt<E+^¥W*.-¥35"ilI« Brown 

*il**B*tt.aA*H*.a*T- i f;ffAtt 



Birmingham »**.4"*. ***#A*T**Vl 

fcSaa+fe*ftX**ft.-Hfc*tt.Ktftt«tt- 

**aa&ffttttta.a&.M*aft**aa«K 
«****ft.»tt**Nae**-+«a.»T 
**& nlp xft«»ia*«Att*sRxax* 
i>aMa»m*.aar*ffKAA«aa*a 
a-tt*tt**«a.jf*fc*HK.»*T«** 
**«rea±jt»fw^***BaNLp *««p» 
-+*A.*xa*A*«*Httflf*att-Aft 
ft.»*?*tt«ftaaftit*fta*£xftatt 

*#-+*»»**. 

******#+.»?*tt»ftaa**tt. 

****«« **W *»W*JlW»±»*R . e« 
R*»«*.i>««***t*#*-i9*A*.jjf* 
tt*tt(»Tttttaaaftft**>«ft**tt** 

iRaa.sr^iaaaaaTtaaM.**** 
a*«#».a#*a**a*-a«K**.** 
**aa*tttt»w.T**#**«-T*+» 
-«»*a*r»*». 

1>R*P<A> «3«-'r#*£A*,«ftA 

ftaaaaa. 

2)*ft«*P(AIC) **4*ftC*±|»* 

fFT.*ttAX445»ratt.«*B,ajg-i'f»stt 

« w,-e4i5**«P^*i»! n P(nfw). 
3>R$«*P(A.B> ***ttAfOBRBt« 



• )*3c»»aj8ft#**K*.ii s a±A.aa*a»aaaaa.«aaa.ttaa«a. 



*w*tmnxnsi *«**>+ .mm&* 

P(A||B)— ,G-l,2,-..,.)a 
gP<B|A,)P(A,) 

p, A(R v P(B'A)P(A) 

P<A ' B >-P(B|A;P(A;+P(B|X)P(X) 

_ P(B'A)P(A) 

PlBj— 

7)*,H 2p(x,)IogP<x,) *flTAfc«p|ft 

«>**»&tt*1*a ***** 

<X*3CflrA)I(x,y)€jt*, 

* « ft y f»«»«^ft«.ip.« I(x.y»0,M«« 
x * y ***Jfl*W,£ IU,y)-O.M«« x J, y * 
KAN .* KxiyXO.JB** x if y 

ft^fcAWItJMMlfliSCword MsocUtion)* 
iU#«(word co-occurrence)#»<!.6(;«;ttfe#«* 

turn. 

3-i *mt*«s! 

** at&t 5i*$.ifei«PAmaft»»-*i» ? f 

* (l )* A feat JS • U - * W iftiR ft * * (O)M. a - 
*»ta,fiP,i-*Ftta-o.-T6aiifi:&#;l- 



ft*-+tt«ttftftt o «P*Jt#»ftA i *r«« 
Pa|0>«»A**fr**lPJ 

t=«rgm«xPa|0)=«rgm*xP(I)P(0|l> 

P(D* 1 £»if W*A*afltW* 
*.«tB,ft»#iR*'p,'E*«HSA*Ul I lift*. 

fc*£±.fc«K**ft*ai^Hit.«(nft*w 
*«»^*#*a»#iwwfti*ttft*«£* 

fcitft* P(0|])*S I lii*«*A*lt O If 

««*ttftiii*di«fim*.*x«X£*att* 

ATM O. . Mtt*.«* 

ft*«ft^gJ8ra«.ft*P, + 

*ft«IM» .-f.rm-««*nrit*-form-»«aj . 
(OCR)ft g a*34»t«**tt1RJN&a«*.ft 

X-P.w ft-**ft*?tt.#?**lRjMRfl,Y 
OCR + .Y ASfrftWttttffl 

«r*,ftft»j?tt* iej*+,y ft*-*pm*«« 

*A^flP».i»#,(BI«i»B»*fitti^JHtit#« 
-^#«*^ff*^,tf,^-«Bn««P(W)P(Y! 
W). 

*ia c u«ft**ftttAa.*£*Ts«4ft* 

r»^ft*-w.ft tiujc **4 w Miff 

«Tft e c.*# pretty** ad nr4T*»m . 

C=argmaxP(C;P(W l C) 

x*iP(c>ftP(wio*uflAjijm«*ttx* 

«P*fl#»{!fTi-»Htt-«a»it«* Pfc.lc.-2a- 

i>ft-«HEft* P(w > re.>jtrrNibitjtftftM. 
(3m#ft*.m»«*(MT)iFKR*is^ 

• 37 • 



«#^a»»ffT-*if?Itt**.W«Trr(1949) 
*-**dST-#^ MTttfrJ.fcttatffc.I,* 

+fcfT7*«.***Ji*Aff**iS SYSTRAN 
**.ftiS ,MT »*»^»«If)!«(t)iP*ffl«tt 

IBM M P. F. Brown *A«»«I1Mt-*S: 
*T Weaver # MT»AlfrttjMI*tt.*fl*iHr 

E-ars»™»P(E)P(FlE) 

&tt**£tt*«***afeiu 

PCFIE),!!!^^*-^****^*^* 

«»*AY*WK*tttf** 

*MH« .*fr«**«T- 

ffl»Fffif«naiiqwjta.w-ii^«ra-E.-» 
r* w «£t*atJg. e 

W«argmjxP(W)P(E|W> 

»*. 

4 ait«&*«*o#ftttit 

f»iti-ir^*iR* p(Dfp«ji«* pco!i)*if^w 
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**+tttm*tirtft!M*tttt«att*ft.T 
»*«»**»*ai«fn#*ttttW#/8;!rfc. 
4-1 ftitfliafttt* 

P(W)=P(w,w,-w.)«-nP(w.|w,-w._ 1 ) 

-.w.-,tt*ftT.a*« w. n- 
«a > #fij»irflri».ffli*T-*n*ir*«. 

m Afflf*.»«T bigr»m iiaCPCw,!*,.,))* «ri- 
gram *a(P(w l |w,-2w,_,)). 

«a«ra«.a#^i5w*ttTtjn»a.Taiiaw 

1-**tt*«iftW-Tl)iaW«5t^tt. 

ilM1k#&.*t : ?+m+ W-w,.w„".,w. # 

w** ie* c-c, ,e, , ». ,c .tmt*+n5m** a 
*a. 

PCWtC>-P(w,w,™w. |c,c,—c.>- nP(w, Ic.) 

*.»M *« , *«-*J*»*J* SE = we, .we, , •« , 
we. we, »**ife**J? SF-wf,.wf„--. 

wf. wf„-*Br«*JBTH HfMT*. 

>X^(lraD»l«tion).jr4««)^JtUe«n airoe 
Marie I John loves Mary) John Jean, loves 

A aime. ffi Mary Marie. 

»ot ## »e-p.. * 

■TWiH* . f-o. 

e)«JP(di»tortion).iT»ff*ffl^«»^ra. 

uta. 

P (SF ! SE ) = II [ P (f , ! we, ) • nP ( wf, I we, >] 
•pP(i|j) 

K «p P(f, f we,) ,P(wf, f we,)*B P(« 



a>***a*&fttt(MLE>.««-T*« w « 
maw** p(w)ff#-*^«#.jua 
***** n ja^^rBt ,ttf]?rw«ia** w 

P(w)-f(w)/N 
ft* f(w)*#iil w «»«4«f difltt»K.iSftft 

mle fttt^ir*. 

att#fc«*a*«.4Wp*aaT**#» 
it8*atttttt.&*.£*«7ttaff**£«a 

a **+«*«7**tf #r a 

iil*(»a«.tfifc.MLE *Sttt»tt*a*ttttit 

*J8 .ft*»^Ka*i»#*ttir;!fifciigi!l** . 



«1 »-«™>**#-*J 

jig l-grire -2-gnm 3-grun 

1 36.789 8,045.024 53.737.350 

2 20,289 2,065,469 9,228,958 

3 13.123 970,434 3,653,791 
>3 135,335 3,413,290 8,728.789 
>0 205.516 14.494.217 75.349.888 
>0 260.741 6.799X10" 1.773X10" 

*1«P Witt 7ft356,893,263t«*lft«;S:*# 
«*iB«»***E**ttttl-gr.m,2-gr«m.3- 
gr«m a #H<I*iiJ*fS* 7 260.740>h 

ft* ifl « <P to flftft a* 1 £ ± ». ft * * *T IK W 
6,799X10" , ^2-gram #*<P,.R*n4.494,217T# 
**IEft*#:fc*+tfja.#fift1'*8.045.024+ 

h a a t - * . £ itu* . ft a * *r ft h i . 7 73 x i c 

3-gram #*'P..R*r75.349,888'h#*ft]EtfS«ft 

»»«P,#aft« T '*53.737.350^ttttaif T-*, 

ik4s«ti?ri»*Httiraffft*xttRMar E 



A».M» n l&i**.n-gram HatfftMM* 

******.fl**««£*e*«RIJ,**tt 

wl-»>?r*T**tB. 

P(w, | ) - Zx,<wjr« )P W ( Wi ! wS"») 
as p° > (w l iw , r')w«.M ^(wi-^irju em ftafea 

ff it ft . 1. S X, ( w'f 1 > = 1 . % * 1 - . 2- , ft 3-gr am ft 

P(w, I wil}) -X,P (w,)+Jlrf»(w, I w,_, ) 
+X,P(wilwi- t wi_i) 
Ney H- [l994]*frfflT-«H***tt#«fc*ffl[ 

w«sttt#*fts 
«*+tusiw**.W]SK»«*N«.ia*#*ft 

»****« r fc.W* MLE p-r/N.Sl 

* r-# r »«SEa*,wifc#*w«*«BrttTtA, 

P-rVN 

SN,xr 

N ~ 1 

ft«p n, r w 

A** r «J**.N #»******(.&«»>. 

*»HLW«l«ai*I^-ife* Good-Turing ^Tjfc . 
fctf f =(r+l)N t+1 /Nr,|tfla«l**j£|r held out 
fSitfO deleted teit.Church K. ft Gale.w. (1991) 

Afttfifctt Atattftiiff T if « i&#*r * 

«. 

*#aMt-1*«*tt*« 0(fMCl/N).b)«Jg*£ 

«»***»*««»#*»*«-*..».«£« 
»«*(ap«Ako**i/N. 



* * .i^*«*#*fll#iS, SE (collocation) * 
(co-oceorr e nce)#«*(l«i,)*Jta«^J-a S |^pj 
BHM.- y h*aBi«^Jt itrong ft powerful. Halli- 
d.y ([I966])a*«§t strong ft powerful 

itti» 6jaft«A.a*#tt*ft a H*j4fl»^w 

«*<$n.«rong tea ft powerful computer). -OS* 
butler ft doctor/nurse.6*$r**«*!.Jtp|+Jf 

fc.ta*fc*iaaft*» nlp a*.tf m*st« 

-t* tH-wawamt-** 

JfetSA (mutual information)itJ(EKS. 

#««*fty.«**AK«iyJltgl*TP»^* 

fl« MLE tffctttf- PCx) .P(y) , bI W#» , 

aw**** p<«.y).jci*rua&i(tx-+* 
** w i^fta** □ a ***** 

x.y **a.«fr *!*«««*« 

f(x.y)*DaW»Tt-.ip, 

Ak**r*f*ftit**««i«;ife«rA,in,*«ft* 

«|B|*««»jt* * (doctor/nune) .** Hft fcifl 
ffl*»tf J2* (potent medicine J? .trong cur- 
rency),*«ft*«ttBl«»B:(uke a decision iff* 
JB m.ke,pay attention IB*". gWe)*.fc£«ftXt 

?atn mis xftti ft a *»«•«* jr . 

Hindle ft Rooth ftff **aw7#**it»* 
«ftK#*tM*tt*ft±MfH!.**8#-'* 

^tS^J^FiSheCwanted 'placed ! put) the dress on the 
rack. Xt?*H»aiiiJ.:fri*1*)5(on the rack)«i£ 
»^TI<0ft5fC-flr.'EBrWm**«(Jt wanted) .f&BT 
K ***»***fiSK* Placed .put) .****« 
6]^^rff «P*tffl*«^«fi»*«(PP attach- 
ment)(6J«.Hi»dle ft Rooth ttW*E*«(!.~+#*f 



stftX«^MCd M ~o.>nMtt«MrAflUtfiHs 

tt***fttttt4r*f**. Magerman 

«***tt* a 

T«ff»»*.-ae*«|»iFJtafiJS Brent M. 
(1993),KobayaJu Y. *<1994>AjIflf. 

*r*±,*^«litttifett^*fftitT*Mlr54r 

«a*ftirtt».H*t3i»i» nlp use**** 

T3t** iS & (Stochastic Context Free Grammar. 
SCFG).-£#$t CFG «« A-BC P 
<A^>.**tt#ftft TO** In.ide-Out.ide * 

**at# scfg 

CFG Wft*. 5J 

*L*J.Bricoe T. *(1993).T.panainen P. 9f (1994). 
Magerman a *(1990)**NftftTtffi**ti#«r 

*X»»*^fflT»^*»r**M|t!«tit#» 
e.«*Wtt«**«»*r1«.ii«H*Bil*» 

fn\#*ftfWttft#tt*Wff(Expl. Mt io» Data 
Analysis. EDA).«etttt*«mA«*t>»£M*. 

ING.ACL.TMI 4^*JB»*i**|fi8SF»Tlft*£ii 
VF^IF ft*ft.*. 3 ttilttttit. g *4HHfft.* 

ft***. 

*a-^*« mm . && tax unto a*** 

«.«tt«i£.»iaa»#ffft*PSt.*A«it#» 
ctfiffi«i*»»iR*.A-pBrw«t*st*«»« 

*«#€B*Wfll?tlfr5fll.(#*^«^7t!*) 
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